Student Information

Name: Samuel Perez

Student ID: 107065434

GitHub ID: perezsam


Instructions

  1. First: do the take home exercises in the DM19-Lab1-Master Repo. You may need to copy some cells from the Lab notebook to this notebook. This part is worth 20% of your grade.
  1. Second: follow the same process from the DM19-Lab1-Master Repo on the new dataset. You don't need to explain all details as we did (some minimal comments explaining your code are useful though). This part is worth 30% of your grade.
    • Download the the new dataset. The dataset contains a sentence and score label. Read the specificiations of the dataset for details.
    • You are allowed to use and modify the helper functions in the folder of the first lab session (notice they may need modification) or create your own.
  1. Third: please attempt the following tasks on the new dataset. This part is worth 30% of your grade.
    • Generate meaningful new data visualizations. Refer to online resources and the Data Mining textbook for inspiration and ideas.
    • Generate TF-IDF features from the tokens of each text. This will generating a document matrix, however, the weights will be computed differently (using the TF-IDF value of each word per document as opposed to the word frequency). Refer to this Sciki-learn guide .
    • Implement a simple Naive Bayes classifier that automatically classifies the records into their categories. Use both the TF-IDF features and word frequency features to build two seperate classifiers. Comment on the differences. Refer to this article.
  1. Fourth: In the lab, we applied each step really quickly just to illustrate how to work with your dataset. There are somethings that are not ideal or the most efficient/meaningful. Each dataset can be habdled differently as well. What are those inefficent parts you noticed? How can you improve the Data preprocessing for these specific datasets? This part is worth 10% of your grade.
  1. Fifth: It's hard for us to follow if your code is messy :'(, so please tidy up your notebook and add minimal comments where needed. This part is worth 10% of your grade.

You can submit your homework following these guidelines: Git Intro & How to hand your homework. Make sure to commit and save your changes to your repository BEFORE the deadline (Oct. 29th 11:59 pm, Tuesday).

In [5]:
### Begin Assignment Here
In [ ]:
# necessary for when working with external scripts
%load_ext autoreload
%autoreload 2

1. Take Home Exercises

We start by setting up our libraries, helpers and take home excercises data set prepartation.

Apparently github doesn't handle very well the new version of plotly library, please refer to Rendered Homework 1 if graphs are missing.

In [4]:
# import libraries
import pandas as pd
import numpy as np
import nltk
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import plotly as py
import math
%matplotlib inline

from sklearn import preprocessing, metrics, decomposition, pipeline, dummy

# data visualization libraries
import matplotlib.pyplot as plt
from plotly import tools
import seaborn as sns
from mpl_toolkits import mplot3d
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Dimensionality Reduction
from sklearn.decomposition import PCA

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats.stats import pearsonr
from sklearn.naive_bayes import MultinomialNB

##---Take home excercises setup---#
# prepare dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

# my functions
import helpers.data_mining_helpers as dmh

# construct dataframe from a list
X = pd.DataFrame.from_records(dmh.format_rows(twenty_train), columns= ['text'])

# add category to the dataframe
X['category'] = twenty_train.target

# add category label also
X['category_name'] = X.category.apply(lambda t: dmh.format_labels(t, twenty_train))

Excercise 2:

Experiment with other querying techniques using pandas dataframes.

In [6]:
#query the first 15 records where categry is either 1, 2 or 3
X.query('(category == [1, 2, 3])')[0:15] 
Out[6]:
text category category_name
0 From: sd345@city.ac.uk (Michael Collier) Subje... 1 comp.graphics
1 From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... 1 comp.graphics
2 From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... 3 soc.religion.christian
3 From: s0612596@let.rug.nl (M.M. Zwart) Subject... 3 soc.religion.christian
4 From: stanly@grok11.columbiasc.ncr.com (stanly... 3 soc.religion.christian
5 From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... 3 soc.religion.christian
6 From: jodfishe@silver.ucs.indiana.edu (joseph ... 3 soc.religion.christian
7 From: aldridge@netcom.com (Jacquelin Aldridge)... 2 sci.med
8 From: geb@cs.pitt.edu (Gordon Banks) Subject: ... 2 sci.med
9 From: libman@hsc.usc.edu (Marlena Libman) Subj... 2 sci.med
10 From: anasaz!karl@anasazi.com (Karl Dussik) Su... 3 soc.religion.christian
11 From: amjad@eng.umd.edu (Amjad A Soomro) Subje... 1 comp.graphics
14 From: sloan@cis.uab.edu (Kenneth Sloan) Subjec... 1 comp.graphics
15 From: Mike_Peredo@mindlink.bc.ca (Mike Peredo)... 1 comp.graphics
16 From: texx@ossi.com (Robert "Texx" Woodworth) ... 2 sci.med

Excercise 5:

Please check the data and the process below, describe what you observe and why it happened. $Hint$ : why .isnull() didn't work?

In [7]:
NA_dict = [{ 'id': 'A', 'missing_example': np.nan },
           { 'id': 'B'                            },
           { 'id': 'C', 'missing_example': 'NaN'  },
           { 'id': 'D', 'missing_example': 'None' },
           { 'id': 'E', 'missing_example':  None  },
           { 'id': 'F', 'missing_example': ''     }]

NA_df = pd.DataFrame(NA_dict, columns = ['id','missing_example'])
NA_df
Out[7]:
id missing_example
0 A NaN
1 B NaN
2 C NaN
3 D None
4 E None
5 F
In [8]:
NA_df['missing_example'].isnull()
Out[8]:
0     True
1     True
2    False
3    False
4     True
5    False
Name: missing_example, dtype: bool
In [ ]:
# Answer here
'''
DataFrame.isnull(self) method will only return True if a value is missing, 
NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values.

In this case the explicity declared strings 'NaN', 'None' and empty string '' are evaluated as False because the method 
has no way of knowing what is the value inside the string.

'''

Exercise 6:

In [10]:
X_sample = X.sample(n=1000) #random state
In [11]:
len(X_sample)
Out[11]:
1000
In [12]:
X_sample[0:4]
Out[12]:
text category category_name
44 From: rgasch@nl.oracle.com (Robert Gasch) Subj... 2 sci.med
1454 From: anello@adcs00.fnal.gov (Anthony Anello) ... 2 sci.med
1643 From: ezzie@lucs2.lancs.ac.uk (One of those da... 1 comp.graphics
1958 From: dhawk@netcom.com (David Hawkins) Subject... 3 soc.religion.christian

Notice any changes to the X dataframe? What are they? Report every change you noticed as compared to the previous state of X. Feel free to query and look more closely at the dataframe for these changes.

In [ ]:
# Answer here

'''
Our original data set X is not affected when using the DataFrame.sample() method.
The method only creates a copy of ramdomly selected items which then is assigned to the X_sample.

Since the samples are taken randomly, the method doesn't ensure sorting. If we want to sort in 
ascending order we can do X_sample.sort_index().
'''

Exercise 8:

We can also do a side-by-side comparison of the distribution between the two datasets, but maybe you can try that as an excerise.

In [13]:
# Answer here
sample_counts = X_sample.category_name.value_counts()
actual_counts = X.category_name.value_counts()

combined_data_frame = pd.DataFrame({'dataset': actual_counts,
                    'sample': sample_counts}, index = categories)

print(combined_data_frame.plot.bar(title = 'Category Distribution', rot = 0, fontsize = 12, figsize = (8,4)))
AxesSubplot(0.125,0.125;0.775x0.755)

Exercise 10:

We said that the 1 at the beginning of the fifth record represents the 00 term. Notice that there is another 1 in the same record. Can you provide code that can verify what word this 1 represents from the vocabulary. Try to do this as efficient as possible.

In [14]:
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X.text)
In [15]:
# Answer here
array = X_counts[4:5, 0:100].toarray() #obtain the fifth record

'''
We can print all the words contained in the sentence from the first 100 words, 
the second word printed correspond to the second 1 in the array

'''

for word in count_vect.inverse_transform(array)[0]:
    print('word: %s' % word)
word: 00
word: 01

Exercise 11:

In [16]:
# Answer here

'''
We can use a sample of the whole documents to create a smaller term-document matrix which we can plot to observe some terms 
that are more repeated than others.

We can also remove the vmax to be able to show different colors for different values in the term-document matrix and remove the
number labels inside the heatmap to make it less cluttered

'''
n = 150
sample_X = X.sample(n=n, random_state = 26)
sample_count_vect = CountVectorizer()
sample_counts = sample_count_vect.fit_transform(sample_X.text)
plot_x = ["term_"+str(i) for i in sample_count_vect.get_feature_names()[0:n]]

# obtain document index
plot_y = ["doc_"+ str(i) for i in list(sample_X.index)[:n]]
plot_z = sample_counts[0:n, 0:n].toarray()
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(18, 14))
ax = sns.heatmap(df_todraw,
                 cmap="PuRd",
                 vmin=0, annot=False)

Exercise 12 :

Please try to reduce the dimension to 3, and plot the result use 3-D plot. Use at least 3 different angle (camera position) to check your result and describe what you found.

$Hint$: you can refer to Axes3D in the documentation.

In [17]:
# Answer here
X_reduced3 = PCA(n_components = 3).fit_transform(X_counts.toarray())
print('Dimension:')
print(X_reduced3.shape)

col = ['coral', 'blue', 'black', 'm']
fig = plt.figure(figsize = (25,10))

ax1 = fig.add_subplot(2,2,1, projection='3d')
ax2 = fig.add_subplot(2,2,2, projection='3d')
ax3 = fig.add_subplot(2,2,3, projection='3d')
ax4 = fig.add_subplot(2,2,4, projection='3d')

for c, category in zip(col, categories):
    xs = X_reduced3[X['category_name'] == category].T[0]
    ys = X_reduced3[X['category_name'] == category].T[1]
    zs = X_reduced3[X['category_name'] == category].T[2]
    
    ax1.scatter3D(xs, ys, zs, c= c, marker = 'o')
    ax2.scatter3D(xs, ys, zs, c= c, marker = 'o')
    ax3.scatter3D(xs, ys, zs, c= c, marker = 'o')
    ax4.scatter3D(xs, ys, zs, c= c, marker = 'o')

ax1.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax1.set_xlabel('\nX Label')
ax1.set_ylabel('\nY Label')
ax1.set_zlabel('\nZ Label')
ax1.view_init(0, 0)

ax2.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax2.set_xlabel('\nX Label')
ax2.set_ylabel('\nY Label')
ax2.set_zlabel('\nZ Label')
ax2.view_init(90, 0)

ax3.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax3.set_xlabel('\nX Label')
ax3.set_ylabel('\nY Label')
ax3.set_zlabel('\nZ Label')
ax3.view_init(0, 90)

ax4.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax4.set_xlabel('\nX Label')
ax4.set_ylabel('\nY Label')
ax4.set_zlabel('\nZ Label')
ax4.view_init(30, 45)

plt.show()
Dimension:
(2257, 3)

Observations: Data appears to be more sparse along the z and y axes and appears to be more compact along the z and x axes

Exercise 13:

Interactive visualization of term frequencies

In [18]:
term_frequencies = np.asarray(X_counts.sum(axis=0))[0]
In [19]:
# Answer here
data = go.Bar(x = ["term_"+str(i) for i in count_vect.get_feature_names()[0:300]],
            y=term_frequencies[:300])

fig = go.Figure(data)

fig.update_layout(
    title=go.layout.Title(
        text="Term Frequencies",
        xref="paper",
        x=0
    )
)

fig.show()

Exercise 14:

Visualization of reduced number of terms

In [20]:
# Answer here
term_frequencies_df = pd.DataFrame({'terms': count_vect.get_feature_names(), 
                                            'counts': term_frequencies})
sample_term_frequencies_df = term_frequencies_df.sample(n=100, random_state=26)

sample_data = go.Bar(x = ["term_"+str(i) for i in sample_term_frequencies_df['terms']],
            y=sample_term_frequencies_df['counts'])

fig = go.Figure(sample_data)

fig.update_layout(
    title=go.layout.Title(
        text="Sample Data Terms",
        xref="paper",
        x=0
    )
)

fig.show()

Exercise 15:

Sort the terms on the x-axis by frequency instead of in alphabetical order.

In [21]:
# Answer here

#for efficiency we will use a samnple of the dataset
#order the terms
ordered_term_frequencies_df = sample_term_frequencies_df.sort_values(by = 'counts', ascending = False)

#generate graph
ordered_data = go.Bar(x=["term_"+str(i) for i in ordered_term_frequencies_df['terms']],
            y=ordered_term_frequencies_df['counts'])

fig = go.Figure(ordered_data)

fig.update_layout(
    title=go.layout.Title(
        text="Long-tailed distribution in Sample",
        xref="paper",
        x=0
    )
)

fig.show()

Exercise 16:

Try to generate the binarization using the category_name column instead. Does it work?

In [23]:
# Answer here
mlb = preprocessing.LabelBinarizer()
mlb.fit(X.category)
X['bin_category'] = mlb.transform(X['category']).tolist()
X[:10]
Out[23]:
text category category_name bin_category
0 From: sd345@city.ac.uk (Michael Collier) Subje... 1 comp.graphics [0, 1, 0, 0]
1 From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... 1 comp.graphics [0, 1, 0, 0]
2 From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... 3 soc.religion.christian [0, 0, 0, 1]
3 From: s0612596@let.rug.nl (M.M. Zwart) Subject... 3 soc.religion.christian [0, 0, 0, 1]
4 From: stanly@grok11.columbiasc.ncr.com (stanly... 3 soc.religion.christian [0, 0, 0, 1]
5 From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... 3 soc.religion.christian [0, 0, 0, 1]
6 From: jodfishe@silver.ucs.indiana.edu (joseph ... 3 soc.religion.christian [0, 0, 0, 1]
7 From: aldridge@netcom.com (Jacquelin Aldridge)... 2 sci.med [0, 0, 1, 0]
8 From: geb@cs.pitt.edu (Gordon Banks) Subject: ... 2 sci.med [0, 0, 1, 0]
9 From: libman@hsc.usc.edu (Marlena Libman) Subj... 2 sci.med [0, 0, 1, 0]
In [24]:
mlb.fit(X.category_name)
X['bin_category'] = mlb.transform(X['category_name']).tolist()
X[:10]
Out[24]:
text category category_name bin_category
0 From: sd345@city.ac.uk (Michael Collier) Subje... 1 comp.graphics [0, 1, 0, 0]
1 From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... 1 comp.graphics [0, 1, 0, 0]
2 From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... 3 soc.religion.christian [0, 0, 0, 1]
3 From: s0612596@let.rug.nl (M.M. Zwart) Subject... 3 soc.religion.christian [0, 0, 0, 1]
4 From: stanly@grok11.columbiasc.ncr.com (stanly... 3 soc.religion.christian [0, 0, 0, 1]
5 From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... 3 soc.religion.christian [0, 0, 0, 1]
6 From: jodfishe@silver.ucs.indiana.edu (joseph ... 3 soc.religion.christian [0, 0, 0, 1]
7 From: aldridge@netcom.com (Jacquelin Aldridge)... 2 sci.med [0, 0, 1, 0]
8 From: geb@cs.pitt.edu (Gordon Banks) Subject: ... 2 sci.med [0, 0, 1, 0]
9 From: libman@hsc.usc.edu (Marlena Libman) Subj... 2 sci.med [0, 0, 1, 0]

Observation: Binarization with category_name column returns exactly the same result as binarization with category

2. Working with the New Data Set

In this section we perform different operations with the new data and also present data visualization from task 3.

A. Create a Dictionary with the given Data Set

In [25]:
#load data into python array
sentiment_data_array = []
with open("sentiment_labelled_sentences/amazon_cells_labelled.txt","r") as amazon_data:
    sentiment_data_array += [string + '\tamazon' for string in amazon_data.read().split('\n')]
with open("sentiment_labelled_sentences/imdb_labelled.txt","r") as imdb_data:
    sentiment_data_array += [string + '\timdb' for string in imdb_data.read().split('\n')]
with open("sentiment_labelled_sentences/yelp_labelled.txt","r") as yelp_data:
    sentiment_data_array += [string + '\tyelp' for string in yelp_data.read().split('\n')]

#create dictionary with the array
sentiment_data = dmh.sentiment_data_dictionary(sentiment_data_array)

B. Converting Dictionary into Pandas Dataframe

In [26]:
# construct dataframe from the created dictionary
sentiment_data_df = pd.DataFrame.from_records(data = {"sentence":sentiment_data['sentences'], "score":sentiment_data['scores'], "source":sentiment_data['sources']})

Print the first 10 records from the dataframe

In [27]:
# first 10 records from the dataframe
sentiment_data_df[:10]
Out[27]:
score sentence source
0 0 So there is no way for me to plug it in here i... amazon
1 1 Good case, Excellent value. amazon
2 1 Great for the jawbone. amazon
3 0 Tied to charger for conversations lasting more... amazon
4 1 The mic is great. amazon
5 0 I have to jiggle the plug to get it to line up... amazon
6 0 If you have several dozen or several hundred c... amazon
7 1 If you are Razr owner...you must have this! amazon
8 0 Needless to say, I wasted my money. amazon
9 0 What a waste of money and time!. amazon

Print the last 10 sentences

In [28]:
# last 10 records keeping only sentence and source column
sentiment_data_df[-10:][["sentence", "source"]]
Out[28]:
sentence source
2990 The refried beans that came with my meal were ... yelp
2991 Spend your money and time some place else. yelp
2992 A lady at the table next to us found a live gr... yelp
2993 the presentation of the food was awful. yelp
2994 I can't tell you how disappointed I was. yelp
2995 I think food should have flavor and texture an... yelp
2996 Appetite instantly gone. yelp
2997 Overall I was not impressed and would not go b... yelp
2998 The whole experience was underwhelming, and I ... yelp
2999 Then, as if I hadn't wasted enough of my life ... yelp

Query every 10th record, first 10 records are printed

In [29]:
# using loc (by position)
# query every 10th record in our dataframe, the query also containig the first 10 records 
sentiment_data_df.iloc[::10, 0:2][0:10]
Out[29]:
score sentence
0 0 So there is no way for me to plug it in here i...
10 1 And the sound quality is great.
20 0 I went on Motorola's website and followed all ...
30 0 This is a simple little phone to use, but the ...
40 1 It has a great camera thats 2MP, and the pics ...
50 0 Not loud enough and doesn't turn on like it sh...
60 0 Essentially you can forget Microsoft's tech su...
70 0 Mic Doesn't work.
80 1 I wear it everyday and it holds up very well.
90 0 For a product that costs as much as this one d...

C. Data Exploration and Manipulation

Check for missing values

In [30]:
sentiment_data_df.isnull().apply(lambda x: dmh.check_missing_values(x))
Out[30]:
score       (The amoung of missing records is: , 0)
sentence    (The amoung of missing records is: , 0)
source      (The amoung of missing records is: , 0)
dtype: object

There are no missing values in the dataset. To make sure that there will not be missing values the sentiment_data_dictionary method in the helpers file will ignore any rows that contain either a sentence none value or a score none value.

It this particular case it is better to ignore the rows that contain a none value because there is no way to generate a sentence if it is missing and if we try to estimate a 0 or 1 value for the score it will just contaminate the data.

Check for duplicated rows

In [31]:
"""
check if there are duplicated rows and remove them

"""
duplicates = sum(sentiment_data_df.duplicated('sentence'))

print('Number of rows before cleaning: %d' % len(sentiment_data_df))
print('Duplicated rows: %d' % duplicates)

if duplicates > 0:
    sentiment_data_df.drop_duplicates(keep=False, inplace=True)
    
sentiment_data_df.reset_index(drop=True, inplace=True) # this is necessary because when droping the duplicates the index
#keeps the old original values and when trying to access them with [] the wrong values were being returned.

print('Rows after dropping duplicates: %d' % len(sentiment_data_df))
Number of rows before cleaning: 3000
Duplicated rows: 17
Rows after dropping duplicates: 2966

Sampling

In [32]:
#sampling
records = 700
sentiment_data_sample = sentiment_data_df.sample(n=records, random_state=26)
In [33]:
#show the sampling and actual data counts in a bar graph
sample_counts = sentiment_data_sample.score.value_counts()
actual_counts = sentiment_data_df.score.value_counts()

combined_data_frame = pd.DataFrame({'data': actual_counts,
                    'sample': sample_counts})

print(combined_data_frame.plot.bar(title = 'Sentiment Distribution', rot = 0, fontsize = 12, figsize = (8,4), tick_label = ['negative', 'positive']))
AxesSubplot(0.125,0.125;0.775x0.755)

Pie chart Vizualization

In [34]:
labels = sentiment_data_df.source.value_counts().index
values = sentiment_data_df.source.value_counts()

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])

fig.update_layout(
    title=go.layout.Title(
        text="Sentences Source",
        xref="paper",
        x=0
    )
)

fig.show()

Scatter plot Visualization

In [35]:
#scatter plot visualization
#show the relation between the word count in each sentence and what sentiment it is attached to
sentiment_data_sample_2 = sentiment_data_df.sample(n = 100)
x_axis_array = ["sentence_" + str(index) for index in sentiment_data_sample_2.index]

fig = go.Figure()

# Add traces
fig.add_trace(go.Scatter(x=x_axis_array, y=[len(sentence.split(' ')) for sentence in sentiment_data_sample_2.sentence],
                    mode='lines+markers',
                    name='Sentence Word Count'))

fig.add_trace(go.Scatter(x=x_axis_array, y=[score for score in sentiment_data_sample_2.score],
                    mode='lines+markers',
                    name='Sentence Score'))

fig.update_layout(
    title=go.layout.Title(
        text="Scatter Plot",
        xref="paper",
        x=0
    )
)

fig.show()

positive_sentences_array = [len(row.sentence.split(' ')) for index, row in sentiment_data_sample_2.iterrows() if row.score == '1']
print('Average word count in each positive sentence: ', sum(positive_sentences_array)/len(positive_sentences_array))

negative_sentences_array = [len(row.sentence.split(' ')) for index, row in sentiment_data_sample_2.iterrows() if row.score == '0']
print('Average word count in each negative sentence: ', sum(negative_sentences_array)/len(negative_sentences_array))
Average word count in each positive sentence:  12.583333333333334
Average word count in each negative sentence:  11.875

Feature Creation

Data frame with unigrams:

In [36]:
sentiment_data_df['unigrams'] = sentiment_data_df['sentence'].apply(lambda x: dmh.tokenize_text(x))

Feature Subset Selection

Document-Term matrix:

In [37]:
count_vect = CountVectorizer()
frequency_counts = count_vect.fit_transform(sentiment_data_df.sentence)
print("Document Term Matrix Size:", frequency_counts.shape)
Document Term Matrix Size: (2966, 5157)

Heatmap visualization

In [38]:
#heaptmap visualization
sample_count_vect = CountVectorizer()
sample_counts = sample_count_vect.fit_transform(sentiment_data_sample.sentence)
plot_x = ["term_"+str(i) for i in sample_count_vect.get_feature_names()[0:records]]
plot_y = ["doc_"+ str(i) for i in list(sentiment_data_sample.index)[:records]]
plot_z = sample_counts[0:records, 0:records].toarray()
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(18, 14))
ax = sns.heatmap(df_todraw,
                 cmap="PuRd",
                 vmin=0, annot=False)

Dimensionality Reduction

PCA with 2 components along with its scatter plot:

In [39]:
#2 dimension PCA
colors = ['coral', 'blue']
scores = ['0','1']

sentiment_data_reduced2 = PCA(n_components = 2).fit_transform(frequency_counts.toarray())

fig = plt.figure(figsize = (25,10))
ax = fig.subplots()

for c, score in zip(colors, scores):
    xs = sentiment_data_reduced2[sentiment_data_df['score'] == score].T[0]
    ys = sentiment_data_reduced2[sentiment_data_df['score'] == score].T[1]
   
    ax.scatter(xs, ys, c = c, marker='o')

ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')

plt.show()

PCA with 3 components along with its scatter plot:

In [40]:
#3 dimension PCA
sentiment_data_reduced3 = PCA(n_components = 3).fit_transform(frequency_counts.toarray())
fig = plt.figure(figsize = (25,10))

ax1 = fig.add_subplot(1,1,1, projection='3d')

for c, score in zip(colors, scores):
    xs = sentiment_data_reduced3[sentiment_data_df['score'] == score].T[0]
    ys = sentiment_data_reduced3[sentiment_data_df['score'] == score].T[1]
    zs = sentiment_data_reduced3[sentiment_data_df['score'] == score].T[2]
    
    ax1.scatter3D(xs, ys, zs, c= c, marker = 'o')

ax1.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax1.set_xlabel('\nX Label')
ax1.set_ylabel('\nY Label')
ax1.set_zlabel('\nZ Label')
ax1.view_init(25, 45)

plt.show()

Attribute Transformation/Aggregation

Creating an Array that contains the sum of the term frequencies for each word. Bar graph visualization (zoom for details).

In [41]:
#create a term frequencies array
term_frequencies = np.asarray(frequency_counts.sum(axis=0))[0]

data = go.Bar(x=["term_"+str(i) for i in count_vect.get_feature_names()],
            y=term_frequencies)

fig = go.Figure(data)

fig.update_layout(
    title=go.layout.Title(
        text="Term Frequencies",
        xref="paper",
        x=0
    )
)

fig.show()

Long-tailed distribution of term frequencies. Bar graph visualization, term frequencies in descending order (zoom for details).

In [42]:
#ordered term frequencies visualization
term_frequencies_df = pd.DataFrame({'terms': count_vect.get_feature_names(), 
                                            'counts': term_frequencies})
ordered_term_frequencies_df = term_frequencies_df.sort_values(by = 'counts', ascending = False)

ordered_data = go.Bar(x=["term_"+str(i) for i in ordered_term_frequencies_df['terms']],
            y=ordered_term_frequencies_df['counts'])

fig = go.Figure(ordered_data)

fig.update_layout(
    title=go.layout.Title(
        text="Ordered Term Frequencies",
        xref="paper",
        x=0
    )
)

fig.show()

Binarization

score column has only 2 possible values, so it makes more sence to apply binarization to the source parameter.

In [43]:
mlb = preprocessing.LabelBinarizer()
mlb.fit(sentiment_data_df.source)
sentiment_data_df['bin_source'] = mlb.transform(sentiment_data_df['source']).tolist()
In [44]:
# print first 10 rows
sentiment_data_df[0:9]
Out[44]:
score sentence source unigrams bin_source
0 0 So there is no way for me to plug it in here i... amazon [So, there, is, no, way, for, me, to, plug, it... [1, 0, 0]
1 1 Good case, Excellent value. amazon [Good, case, ,, Excellent, value, .] [1, 0, 0]
2 1 Great for the jawbone. amazon [Great, for, the, jawbone, .] [1, 0, 0]
3 0 Tied to charger for conversations lasting more... amazon [Tied, to, charger, for, conversations, lastin... [1, 0, 0]
4 1 The mic is great. amazon [The, mic, is, great, .] [1, 0, 0]
5 0 I have to jiggle the plug to get it to line up... amazon [I, have, to, jiggle, the, plug, to, get, it, ... [1, 0, 0]
6 0 If you have several dozen or several hundred c... amazon [If, you, have, several, dozen, or, several, h... [1, 0, 0]
7 1 If you are Razr owner...you must have this! amazon [If, you, are, Razr, owner, ..., you, must, ha... [1, 0, 0]
8 0 Needless to say, I wasted my money. amazon [Needless, to, say, ,, I, wasted, my, money, .] [1, 0, 0]

Print the last 10 rows

In [45]:
sentiment_data_df[-10:]
Out[45]:
score sentence source unigrams bin_source
2956 0 The refried beans that came with my meal were ... yelp [The, refried, beans, that, came, with, my, me... [0, 0, 1]
2957 0 Spend your money and time some place else. yelp [Spend, your, money, and, time, some, place, e... [0, 0, 1]
2958 0 A lady at the table next to us found a live gr... yelp [A, lady, at, the, table, next, to, us, found,... [0, 0, 1]
2959 0 the presentation of the food was awful. yelp [the, presentation, of, the, food, was, awful, .] [0, 0, 1]
2960 0 I can't tell you how disappointed I was. yelp [I, ca, n't, tell, you, how, disappointed, I, ... [0, 0, 1]
2961 0 I think food should have flavor and texture an... yelp [I, think, food, should, have, flavor, and, te... [0, 0, 1]
2962 0 Appetite instantly gone. yelp [Appetite, instantly, gone, .] [0, 0, 1]
2963 0 Overall I was not impressed and would not go b... yelp [Overall, I, was, not, impressed, and, would, ... [0, 0, 1]
2964 0 The whole experience was underwhelming, and I ... yelp [The, whole, experience, was, underwhelming, ,... [0, 0, 1]
2965 0 Then, as if I hadn't wasted enough of my life ... yelp [Then, ,, as, if, I, had, n't, wasted, enough,... [0, 0, 1]

3. TF-IDF Matrix

In [46]:
#TF-IDF
tf_idf_vect = TfidfVectorizer()
tf_idf_counts = tf_idf_vect.fit_transform(sentiment_data_df.sentence)

print('First 10 Feature Names:', tf_idf_vect.get_feature_names()[0:10])
print('TF-IDF Matrix Size:', tf_idf_counts.shape)
First 10 Feature Names: ['00', '10', '100', '11', '12', '13', '15', '15g', '15pm', '17']
TF-IDF Matrix Size: (2966, 5157)

TF IDF vs Term Frequency comparison visualization

In [47]:
#visualize the 25 terms with the highest values in the TF-IDF Matrix and compare it with the highest values in the 
#Count Frequency Matrix
n = 25
term_tf_idf = np.asarray(tf_idf_counts.sum(axis=0))[0]
term_tf_idf_df = pd.DataFrame({'terms': tf_idf_vect.get_feature_names(), 
                                            'counts': term_tf_idf})
ordered_term_tf_idf_df = term_tf_idf_df.sort_values(by = 'counts', ascending = False)

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Bar(
            x=["term_"+str(i) for i in ordered_term_tf_idf_df['terms']][:n],
            y=ordered_term_tf_idf_df['counts'][:n],
            name = "TF-IDF"),            
            row=1, col=1
)


fig.add_trace(
    go.Bar(
            x=["term_"+str(i) for i in ordered_term_frequencies_df['terms'][:n]],
            y=ordered_term_frequencies_df['counts'][:n],
            name = "Word Counts"),            
            row=1, col=2
)

fig.update_layout(height=700, width=900, title_text="TF-IDF Ordered Data Terms vs Counts Ordered Data Terms")
fig.show()

Naive Bayes Classifier

In [48]:
#Naive Bayes classifier
#term frequency
mnb_term_frequency = MultinomialNB()
mnb_term_frequency.fit(frequency_counts, sentiment_data_df['score'].values)

#TF-IDF
mnb_tf_idf = MultinomialNB()
mnb_tf_idf.fit(tf_idf_counts, sentiment_data_df['score'].values)
Out[48]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Accuracy Testing

In [49]:
#testing the accuracy of the classifier
#test with a single random sentence from the data set
r_sentence = sentiment_data_df.sample(n = 1)
print('Sentence:',r_sentence.iloc[0].sentence,'\nReal Score:', r_sentence.iloc[0].score)
r_sentence_index = r_sentence.index[0]
print('Term Frequency MNB Prediction: ', mnb_term_frequency.predict(frequency_counts[r_sentence_index:r_sentence_index+1])[0])
print('TF-IDF MNB Prediction: ', mnb_tf_idf.predict(tf_idf_counts[r_sentence_index:r_sentence_index+1])[0])
Sentence: But, Kevin Spacey is an excellent, verbal tsunami as Buddy Ackerman – and totally believable because he is a great actor.   
Real Score: 1
Term Frequency MNB Prediction:  1
TF-IDF MNB Prediction:  1
In [50]:
#obtain the accuracy of both models comparing the predicted values with the actual scores of all the sentences in the data set
total_sentences = len(sentiment_data_df)
tf_correct_prediction = 0
tf_idf_correct_prediction = 0
for index, row in sentiment_data_df.iterrows():

    tf_prediction = mnb_term_frequency.predict(frequency_counts[index:index+1])[0]
    if row.score == tf_prediction:
        tf_correct_prediction+=1
    
    tf_idf_prediction = mnb_tf_idf.predict(tf_idf_counts[index:index+1])[0]
    if row.score == tf_idf_prediction:
        tf_idf_correct_prediction+=1
    
print('Term Frequency MNB Prediction Accuracy: %.2f%%' % ((tf_correct_prediction/total_sentences)*100))
print('TF-IDF MNB Prediction Accuracy: %.2f%%' % ((tf_idf_correct_prediction/total_sentences)*100))
Term Frequency MNB Prediction Accuracy: 94.50%
TF-IDF MNB Prediction Accuracy: 95.28%
In [51]:
#testing with a sentences that are not in the dataset
negative_sentence = "I was very disappointed with the service"
n_word_freq = count_vect.transform([negative_sentence]).toarray()
n_word_tf_idf = tf_idf_vect.transform([negative_sentence]).toarray()
print('Term Frequency MNB Prediction for negative sentence: ', mnb_term_frequency.predict(n_word_freq[0:1])[0])
print('TF-IDF MNB Prediction for negative sentence: ', mnb_tf_idf.predict(n_word_tf_idf[0:1])[0])

positive_sentence = "Highly recommended, the service is very good"
p_word_freq = count_vect.transform([positive_sentence]).toarray()
p_word_tf_idf = tf_idf_vect.transform([positive_sentence]).toarray()
print('Term Frequency MNB Prediction for positive sentence: ', mnb_term_frequency.predict(p_word_freq[0:1])[0])
print('TF-IDF MNB Prediction for positive sentence: ', mnb_tf_idf.predict(p_word_tf_idf[0:1])[0])
Term Frequency MNB Prediction for negative sentence:  0
TF-IDF MNB Prediction for negative sentence:  0
Term Frequency MNB Prediction for positive sentence:  1
TF-IDF MNB Prediction for positive sentence:  1

4. Comments about the given code

In [52]:
# the following is not necessary
## we can achieve the same result with just print(twenty_train.data[0])
print("\n".join(twenty_train.data[0].split("\n"))) 
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.

In [53]:
# using a simple print()
print(twenty_train.data[0])
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.

In [ ]:
# in this piece of code the same thing is assigned two times
for j in range(0,X_counts.shape[1]):
    term_frequencies.append(sum(X_counts[:,j].toarray()))
term_frequencies = np.asarray(X_counts.sum(axis=0))[0]

Additional exploration

For further exploration we also worked with similarities between sentiments. 5 random sentences with the same sentiment are compared, using sentences with the same sentiment increase the possibilities of having common words.

In [54]:
#distance similarity
for i in range(0,5):
    #show cosine similarity and pearson's correlation coefficient of 2 random documents with negative sentiment (score == 0)
    r_negative_sentences = sentiment_data_df[sentiment_data_df['score'] == '0'].sample(n = 2)
    print('Sentences:')
    for sentence in r_negative_sentences.sentence:
        print('-%s' % sentence)
    index1 = r_negative_sentences.index[0]
    index2 = r_negative_sentences.index[1]
    
    #obtain the rows in the count frequency matrix corresponding to those indexes
    count_row1 = frequency_counts[index1:index1+1]
    count_row2 = frequency_counts[index2:index2+1]

    #obtain the rows in the TF-IDF matrix corresponding to those indexes
    tf_idf_row1 = tf_idf_counts[index1:index1+1]
    tf_idf_row2 = tf_idf_counts[index2:index2+1]

    #cosine similarity
    print("Cosine Similarity using term count:",cosine_similarity(count_row1, count_row2)[0][0])
    print("Cosine Similarity using TF-IDF:",cosine_similarity(tf_idf_row1, tf_idf_row2)[0][0])
    #Pearson's correlation coefficient
    print("Pearson's correlation coefficient using term count:",pearsonr(count_row1.toarray().ravel(),count_row2.toarray().ravel())[0])
    print("Pearson's correlation coefficient using TF-IDF:",pearsonr(tf_idf_row1.toarray().ravel(),tf_idf_row2.toarray().ravel())[0])
    #Extended Jaccard coefficient
    print("Extended Jaccard Coefficient using term count:", dmh.extended_jaccard_coefficient(count_row1.toarray().ravel(), count_row2.toarray().ravel()))
    print("Extended Jaccard Coefficient using TF-IDF:", dmh.extended_jaccard_coefficient(tf_idf_row1.toarray().ravel(), tf_idf_row2.toarray().ravel()))
    print("\n")
Sentences:
-They dropped more than the ball.
-It lacked flavor, seemed undercooked, and dry.
Cosine Similarity using term count: 0.0
Cosine Similarity using TF-IDF: 0.0
Pearson's correlation coefficient using term count: -0.0012582740955141064
Pearson's correlation coefficient using TF-IDF: -0.0010978114792265785
Extended Jaccard Coefficient using term count: 0.0
Extended Jaccard Coefficient using TF-IDF: 0.0


Sentences:
-It had some average acting from the main person, and it was a low budget as you clearly can see.  
-Poor Reliability.
Cosine Similarity using term count: 0.0
Cosine Similarity using TF-IDF: 0.0
Pearson's correlation coefficient using term count: -0.0011391242680419962
Pearson's correlation coefficient using TF-IDF: -0.0010853774292904736
Extended Jaccard Coefficient using term count: 0.0
Extended Jaccard Coefficient using TF-IDF: 0.0


Sentences:
-I just don't know how this place managed to served the blandest food I have ever eaten when they are preparing Indian cuisine.
-I give it 2 thumbs down
Cosine Similarity using term count: 0.0
Cosine Similarity using TF-IDF: 0.0
Pearson's correlation coefficient using term count: -0.0017815460977836479
Pearson's correlation coefficient using TF-IDF: -0.0015889870996865277
Extended Jaccard Coefficient using term count: 0.0
Extended Jaccard Coefficient using TF-IDF: 0.0


Sentences:
-The character developments also lacked in depth.  
-It's very slow.  
Cosine Similarity using term count: 0.0
Cosine Similarity using TF-IDF: 0.0
Pearson's correlation coefficient using term count: -0.0008894751630917949
Pearson's correlation coefficient using TF-IDF: -0.0007737150898307318
Extended Jaccard Coefficient using term count: 0.0
Extended Jaccard Coefficient using TF-IDF: 0.0


Sentences:
-did not like at all.
-And, FINALLY, after all that, we get to an ending that would've been great had it been handled by competent people and not Jerry Falwell.  
Cosine Similarity using term count: 0.15811388300841894
Cosine Similarity using TF-IDF: 0.105686887744641
Pearson's correlation coefficient using term count: 0.15651762511415096
Pearson's correlation coefficient using TF-IDF: 0.10400299120879647
Extended Jaccard Coefficient using term count: 0.057142857142857134
Extended Jaccard Coefficient using TF-IDF: 0.055791667734807975


Most similarities will be zero(or close to zero) because there are very few words in common between the sentences, even in the sentences that have the same sentiment.

If we calculate the different similarity coefficients of 2 sentences we know have at least one word in common we can observe a non zero value for both the term count vector and the Tf-IDF vector.

In [55]:
#known negative sentence indexes with common words
index1 = 1455
index2 = 1178
print('Sentences:')
print('-%s' % sentiment_data_df.iloc[index1].sentence)
print('-%s' % sentiment_data_df.iloc[index2].sentence)
#obtain the rows in the count frequency matrix corresponding to those indexes
count_row1 = frequency_counts[index1:index1+1]
count_row2 = frequency_counts[index2:index2+1]

#obtain the rows in the TF-IDF matrix corresponding to those indexes
tf_idf_row1 = tf_idf_counts[index1:index1+1]
tf_idf_row2 = tf_idf_counts[index2:index2+1]

#similarities
print("Cosine Similarity of term count:",cosine_similarity(count_row1, count_row2)[0][0])
print("Cosine Similarity of TF-IDF:",cosine_similarity(tf_idf_row1, tf_idf_row2)[0][0])
print("Pearson's correlation coefficient using term count:",pearsonr(count_row1.toarray().ravel(),count_row2.toarray().ravel())[0])
print("Pearson's correlation coefficient using TD-IDF:",pearsonr(tf_idf_row1.toarray().ravel(),tf_idf_row2.toarray().ravel())[0])
print("Extended Jaccard Coefficient using term count:", dmh.extended_jaccard_coefficient(count_row1.toarray().ravel(), count_row2.toarray().ravel()))
print("Extended Jaccard Coefficient using TF-IDF:", dmh.extended_jaccard_coefficient(tf_idf_row1.toarray().ravel(), tf_idf_row2.toarray().ravel()))
Sentences:
-Hackneyed writing, certainly, but made even worse by the bad directing.  
-This is definitely one of the bad ones.  
Cosine Similarity of term count: 0.21320071635561041
Cosine Similarity of TF-IDF: 0.10478927051864853
Pearson's correlation coefficient using term count: 0.2117717938823762
Pearson's correlation coefficient using TD-IDF: 0.10337775419421162
Extended Jaccard Coefficient using term count: 0.11764705882352941
Extended Jaccard Coefficient using TF-IDF: 0.05529161949569872

The similarity of the vectors using the TF-IDF counts is always lower to the similarity of the vectors using the actual word counts.